This is Federal Election Commission data for the Presidential race for 2016; specifically data for the state of NY; this data was last updated on 21-April-2016

The dataset can be found here: http://fec.gov/disclosurep/PDownload.do [It’s the NY.zip file]

Before getting started, this is an important link! ftp://ftp.fec.gov/FEC/Presidential_Map/2016/DATA_DICTIONARIES/CONTRIBUTOR_FORMAT.txt

This shows what the data elements mean!

Let’s load the data & look at all the players

NOTE: In order for read.csv to parse this properly, I needed to append an extra comma at the end of the row header. If the extra comma wasn’t appended, a ‘duplicate row.names’ error would have resulted.

##  [1] Sanders, Bernard          Cruz, Rafael Edward 'Ted'
##  [3] Walker, Scott             Bush, Jeb                
##  [5] Stein, Jill               Rubio, Marco             
##  [7] Christie, Christopher J.  Clinton, Hillary Rodham  
##  [9] Johnson, Gary             Graham, Lindsey O.       
## [11] Trump, Donald J.          Carson, Benjamin S.      
## [13] Paul, Rand                Kasich, John R.          
## [15] Fiorina, Carly            Santorum, Richard J.     
## [17] Jindal, Bobby             Huckabee, Mike           
## [19] O'Malley, Martin Joseph   Pataki, George E.        
## [21] Gilmore, James S IIII     Lessig, Lawrence         
## [23] Webb, James Henry Jr.     Perry, James R. (Rick)   
## 24 Levels: Bush, Jeb Carson, Benjamin S. ... Webb, James Henry Jr.

Interesting! Who is Jill Stein?? Also Gary Johnson and James Gilmore…

Going to get rid of some of the fringe players and less popular candidates

Convert the zip into a factor and remove the extra +4 digits Also save the first 3 digits of the zip separately as that is useful geographic information [denotes an SCF: Sectional Center Facility]

Obtaining extra info on the zip codes from 2010 Census Data & then group the data by SCF and obtain the total population for it

Convert dates & get rid of some unnecessary fields

Supplementing the data set with additional attributes of the candidates including their party, gender, and dates they dropped out of the campaign

Need to use a geom_bar here; geom_histogram does not work because this is not continuous data.


Bernie Sanders & Hillary Clinton have the most # of contributions by far.

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2013-10-11" "2015-12-07" "2016-02-11" "2016-01-10" "2016-03-09" 
##         Max. 
## "2016-03-31"

Interesting! the first contribution date is back in 2013! to whom, by whom? Why so early?


Answer: Marco Rubio!

How about the other candidates before 2015? who believed themselves destined for greatness so early on?

Makes sense that the # of contributions is steadily climbing and hits a maximum on the latest date we have

Let’s look at how much cash came in each year

2013 & 2014 barely register. 2015 stands tall but 2016 is almost caught up (and this data is only 3 months into the year!)

What are the actual amounts?

## Source: local data frame [4 x 3]
## 
##      dt      amt      n
##   (chr)    (dbl)  (int)
## 1  2013        0      2
## 2  2014     9950      8
## 3  2015 29162407  58193
## 4  2016 23591108 125318

You can also see that there are already way many more contributions in the first 3 months of 2016 than there were combined in 2013, 2014, 2015. The Power of an Election Year!

Pre-2015 amount is miniscule compared to the action in 2015. Further justification to drop the pre-2015 data.

Going to eliminate data from the data set because I believe they are outliers or may have some arcane political work-arounds involved

  1. Eliminate anything prior to April 1, 2015 because don’t believe candidates for the most part started fully

  2. Eliminated any election_tp codes other than “P2016” [so only focused on Primary 2016 contributions]

Let’s look at the $ by election type;

##      tp           x
## 1         -53770.84
## 2 G2016   981517.92
## 3 P2016 51632589.11
## 4 P2020     7700.00
##         G2016  P2016  P2020 
##     27   1398 181921      3

-53,770 when it’s blank election type? composed of 27 observations

G2016 has $1M ;composed of 1,398 observations

and people are even giving to P2020 election cycle! (Upon further review, there were only 3 contributions and they were noted as “REDESIGNATIONS”; Interestingly they were all to Lindsey Graham and by prominent NYers, 2 of whom are married to each other! So my guess is Lindsey had some sort of party in NY perhaps?)

let’s nix the above and only concentrate on P2016 data

So now we’re looking only at contributions made after April 2015 and for P2016 cycle

Let’s again plot the contributions

Still shows that Hillary Clinton and Bernie Sanders have gotten the most # of contributions.

The Dems have definitely outraised the Republicans.

Let’s look now at the amounts that were raised

##                           tp          x
## 4    Clinton, Hillary Rodham 35831297.0
## 9           Sanders, Bernard  5654041.6
## 1                  Bush, Jeb  3610144.3
## 8               Rubio, Marco  2499135.3
## 5  Cruz, Rafael Edward 'Ted'  1188737.9
## 3   Christie, Christopher J.   858087.0
## 7            Kasich, John R.   676563.5
## 2        Carson, Benjamin S.   635654.4
## 6         Graham, Lindsey O.   254572.1
## 11             Walker, Scott   220606.0
## 10          Trump, Donald J.   203750.0

This is an interesting perspective here. If both the mean and median are high for a candidate (e.g. see Chris Christie & Jeb Bush), along with a small # of donors, this is an indication that they had a concentrated number of people who backed their campaigns with out-sized contributions.

Meanwhile, Bernie is interesting in that he got both small median & mean contributions, but because he had so many contributions, he has the 2nd biggest haul (after Hillary). He really is being powered by the many. Hillary has the edge, however, because her mean contributions are larger and she also has quite a number of people contributing.

Look at zip codes: which one contributed most and to whom? As a reminder, we are looking at the first 3 digits which constitute an SCF

Very Interesting! The contributions from 2 areas look vastly higher than any others.

Looking at the data, “100” & “101” greatly surpass any other areas!


The above shows Hillary has a commanding lead even in the wealthiest zip codes

Let’s look at the average contribution per SCF

Definitely 101 takes the cake for average contribution per capita! Although 100 & 101 have near similar total contributions, 101 has far fewer people living there.

Actually, based on one reviewer’s feedback, there was a suggestion to plot it on a log scale to really be able to compare. This is what is done below and is quite fascinating


You can see a large # of bars are below 1.00 and some below .01 even! This basically means that there were not a lot of contributions given in those areas. Especially in areas with large numbers of people, those people are not contributing on the whole to the campaigns!

Now, let’s compare ‘avg contribution per capita’ vs. ‘avg contribution per contributor’

Wow, even though 100 & 101 had the most contributions,

Far & away, Hillary Clinton is getting the most bucks.

Just for fun, I’m going to look at anyone who’s an ACTOR

All Actors have gone to the Democratic side. Not a single actor has contributed to a Republican candidate!

Now, we will look at contributions on a daily basis for each candidate


Hillary has some huge spikes! Looking at the top couple of data points though indicates the reason: 31-March, 29-Feb, 31-Dec. If anyone subscribes to candidates’ mailing lists, this is obvious; There are always HUGE drives to solicit $$ at the end of the month. But Hillary’s machine is way stronger than everyone else’s.

Now, let’s look at each of the candidates’ hauls over time; this shows the cumulative sum over time

Hillary’s climb soars over everyone else

Now let’s look at the same data but not on a cumulative basis and only at Republicans


This graph is messy. Would really need to select a subset of candidates to filter on. But I will move on from here.

Let’s look at when the first and last contributions were to each candidate with a vertical line demarcating Jan-01-2016 & additionally the drop out dates of each candidate

Interesting. Candidates were getting money even AFTER they dropped out! Scott Walker & Lindsey Graham dropped out in 2015 and they’re still getting contributions!

Let’s take a look at amounts raised before & after each candidates drop out dates

EXTRAS

A few additional things I was motivated to do after the first Udacity review

EXTRA 1

This was in the NYTimes on 29-May-2016.

http://www.nytimes.com/2016/05/29/business/they-tilt-right-but-top-chief-executives-dont-give-to-trump.html?smid=nytcore-iphone-share&smprod=nytcore-iphone

‘An analysis of political donation from chief executives shows broad support for Republican candidates. Except for the presumptive nominee.’

Seems like an ideal thing for me to cross-verify!

Interesting. This seems to contradict the NYT Article. Hillary Clinton is by far once again the biggest recipient even by all the top ppl [CEOs, C-Officers, Presidents]!

A few things at play here:

  1. NY is a very blue state; I think this is probably the #1 explanation for why Hillary has gotten more $. The article uses data gathered from the Center for Responsive Politics which I’m sure used all the states’ data combined.

  2. A lot of the donors in the article contributed via PAC and organized groups. The dataset I am using only consists of individual contributors so this is not an apple to apples comparison.

  3. My capturing of the Chief* titles via the grep command may have grabbed other people that aren’t actually top execs so that could have skewed the results; Also lots of people like using inflated titles or might own small companies, so this analysis probably caught a lot of people who aren’t titans of industry.

  4. I also grepped for the names mentioned in the article. The only person that popped up of signifance was Wendi Murdoch, the ex-wife of Rupert Murdoch and she gave money to Hillary!

EXTRA 2

Using Maps! [As suggested by a Udacity Reviewer, I thought I’d give this a go] Used http://www.computerworld.com/article/3038270/data-analytics/create-maps-in-r-in-10-fairly-easy-steps.html

I’m going to focus in on data in NYC for visualization purposes.

Needed to download the ZipCode Tabulation file from here: https://www.census.gov/geo/maps-data/data/cbf/cbf_zcta.html

## [1] 1905

Interesting. There are 116 zip codes of contributors that dont have any geographic information. This amounts to 1905 contributions that can’t be mapped without more information. [e.g. 10158, 10104, etc.]

Turns out after further research that some zip codes that are used by contributors are not ‘official’ zip codes. some of them are subsumed by other USPS codes. See http://newyork.hometownlocator.com/zip-codes/data,zipcode,10104.cfm as an e.g; 10104 is contained within 10019

But I won’t worry about these. In fact, if I do:

## [1] "11695"

This shows that all but one zip code in the NYC zipcode data [in the Far Rockaways] is in the geographic data, so I should be pretty good here.

And interestingly,

## NULL
## [1] "11359" "11695"

you can see that there are only 2 zip codes in NYC that did not make any contributions at all [both in Queens]


Yup! That looks like all five boroughs!

Nice static visualization of where the money is coming from. That’s it for now!

Final Plots and Summary

First Plot

Description One

This plot compares the sum of contributions made per zip code to presidential candidates (independent of candidates). Basically, it shows how much money the population in each zip code across NY State gave to the candidates. I chose this plot because it is a revelatory depiction of the disparity in contribution amounts across NY state; there is an overwhelming concentration of money coming from the 2 zip codes: 100 & 101. None of the other zip codes in New York State come even close to matching the contributions made from just these two zip codes. Furthermore, these two zip codes are based in Manhattan and a subset of Manhattan at that. While it wasn’t a surprise that Manhattan had the largest dollar amount in contributions, it was surprising to see how concentrated it was. There are a couple of weird zip codes such as [-11, 000, 003, 011] but since the amounts barely registered, I did not pursue any further analysis on them.

Second Plot

Description Two

This plot depicts the earliest and latest dates a candidate has received contributions (the latest date being 31-March-2016 as that is where this particular dataset ends). One can see that Marco Rubio started collecting contributions as far back as 2013. The plot also shows all the candidates still receiving contributions even on the last possible day. While that makes sense for candidates who are still in the running (Hillary Clinton, Bernard Sanders, Donald Trump), this seems odd for candidates who have dropped out. Why I picked this chart as part of my final three is because of this oddity, i.e., it reveals that candidates are still collecting money even after they dropped out! For a short period of time after they drop out, collecting seems plausible because perhaps there is a pipeline of cash to be deposited, but Scott Walker and Lindsey Graham dropped out in 2015 yet are still collecting money! It definitely seems like something that requires more investigation.

Third Plot

Description Three

This graph shows the sum of contributions on a daily basis for all the candidates from the inception of their campaigns until the last date in the dataset (31-March-2016). Why I found this plot interesting is the huge spikes that Hillary Clinton exhibits at the ends of the months. The spikes tower over everyone else and indicate how strong her fundraising apparatus is (either that or people are just enamored with her). The size of the spikes is one thing, however, they also come at the end of the months. While this is definitely an interesting phenomenon, I think it’s readily explained by the huge end of monthly drives that campaigns make to meet FEC monthly reporting periods. Anyone who subscribes to political emails will have been subject to these.

Reflection

The presidential campaign data set for New York from the FEC contained more than 183,000 contributions ranging from 2013 until the end of March 2016. It was interesting to see that candidates could solicit money from very early on (even after Obama was re-elected in 2012). However, I do have a faint recollection that a candidate couldn’t go into real fundraising mode until after a declaration of candidacy. I recall Jeb Bush somehow building up a significant war chest as he was ‘exploring a bid’ but he hadn’t yet declared his candidacy. More research into this would be required to make sense of this data and perhaps to draw sharper distinctions.

Further areas of analysis:

  1. Break down which zips within the 100 & 101 SCF regions had the most contributions to further examine the concentration.

  2. The occupations of various contributors are free-text and could be anything. A lot of work can be done here to consolidate categories. For example, there was “Attorney” and “Attorney” [with a blank space] as well as “Lawyer”. These records could all be merged. Unfortunately the largest set of contributions came from an occupation of “”. Who knows what these people do for a living?

Resources

http://stackoverflow.com/questions/13239639/duplicate-row-names-error-reading-table-row-names-null-shifts-columns/22408965#22408965

how to tilt the x-axis labels so the candidate names can be read more clearly

http://stackoverflow.com/questions/15951216/too-many-factors-on-x-axis

help with refactoring after subsetting data

http://stackoverflow.com/questions/27296310/refactor-whole-data-frame

Information on significance of 3 digit zip code so treating this as 1 group

Sectional Center Facility [SCF]

http://www.zipboundary.com/zipcode_faqs.html